High-dimensional analysis with pytometry¶

Table of contents¶

  1. What is pytometry?
  2. Reading in data
  3. Transforming channel values
  4. Adding sample metadata
  5. Batch alignment
    1. Preparing the batch control data
    2. Training the model and aligning the data
    3. Visualising the alignment
  6. Clustering
  7. Annotating our clusters
  8. Exporting the data
  9. Statistical analysis
    1. Generating summary statistics
    2. Plotting heatmap of summary statistics
    3. Principal component analysis of samples
    4. Pairiwse comparisons
    5. Filtering comparisons of interest

What is pytometry?¶

Pytometry is a Python package that provides tools for reading in flow cytometry data and applying pre-processing tools. While it provides methods for compensation and transformation (just like flowkit), it's focus is on the high-dimensional and unsupervised analysis of cytometry data. 'Pytometry' uses a data structure called AnnData ("Annotated Data") that is commonly used for single cell analysis, most notably the scanpy package often used for RNAseq analysis. This allows us to apply algorithms and visualisations to our cytometry data from the transcriptomics world. The AnnData structure may seem a little intimiating at first, but we'll learn as we go, and hopefully you'll see it's just a collection of matrices/DataFrames of different kinds of data, all linked back to the data by the sample metadata.

In the diagram below (from the AnnData package's documentation):

  • X is the expression matrix where each row is an event and each column is a variable
  • var is a DataFrame of metadata about each variable in X
  • obs is a DataFrame of metadata about each observation (event) in X
  • uns is a dictionary of unstructured metadata that doesn't fit anywhere else

There are some other subfields in the diagram that either aren't relevant for not, or we'll come back to them later.

annData

Note: pytometry assumes your data have already been cleaned of debris, dead cells, and doublets. Most people will find it easiest to do this using software like FlowJo and exporting cleaned .fcs or .csv files. Alternatively this can be done in flowkit as in the previous tutorial.

We start by importing the packages we're going to need. There are quite a lot of them in this tutorial so you may wish to take note of the alias we use for each. Don't forget that before using these packages you will need to install them, for example using pip install bokeh pytometry in a terminal/powershell.

In [ ]:
import scanpy as sc                          # for single cell analysis
import anndata as ann                        # for annotated data structure
import cytonormpy as cnp                     # for correcting batch effects
import numpy as np                           # for numpy arrays
import pandas as pd                          # for pandas dataframes
import matplotlib.pyplot as plt              # for plotting
import seaborn as sns                        # for more plotting functions
import joypy                                 # for making ridgeplots
import pytometry as pm                       # for high-dim cytometry functions
import os                                    # for working with directories
from sklearn.decomposition import PCA        # for principal component analysis
import scipy                                 # for stats functions

Reading in data¶

For this tutorial, you should have the 16 example .fcs files in the folder data/clean from your working/project directory. You can check this using the os.listdir() function as below. You should see the same output as shown here.

In [ ]:
path = 'data/clean/'
files = os.listdir(path)
files
Out[ ]:
['Mock_01_A.fcs',
 'Mock_02_A.fcs',
 'Mock_03_A.fcs',
 'Mock_04_A.fcs',
 'Mock_05_B.fcs',
 'Mock_06_B.fcs',
 'Mock_07_B.fcs',
 'Mock_08_B.fcs',
 'sample_details.csv',
 'Virus_01_A.fcs',
 'Virus_02_A.fcs',
 'Virus_03_A.fcs',
 'Virus_04_A.fcs',
 'Virus_05_B.fcs',
 'Virus_06_B.fcs',
 'Virus_07_B.fcs',
 'Virus_08_B.fcs']

We'll make use of that .csv file later, but for now let's get a list containing just the .fcs file names.

In [ ]:
files = [fileID for fileID in files if fileID.endswith(".fcs")]
files
Out[ ]:
['Mock_01_A.fcs',
 'Mock_02_A.fcs',
 'Mock_03_A.fcs',
 'Mock_04_A.fcs',
 'Mock_05_B.fcs',
 'Mock_06_B.fcs',
 'Mock_07_B.fcs',
 'Mock_08_B.fcs',
 'Virus_01_A.fcs',
 'Virus_02_A.fcs',
 'Virus_03_A.fcs',
 'Virus_04_A.fcs',
 'Virus_05_B.fcs',
 'Virus_06_B.fcs',
 'Virus_07_B.fcs',
 'Virus_08_B.fcs']

The next step is to read these .fcs files into a list of AnnData objects. First we create an empty list that will hold the data, then we iterate over each filename, reading in the data with the pm.io.read_fcs() function, removing unwanted parameters with the pm.pp.split_signal() function, then adding that data to the list using the append() method. If you call ?pm.pp.split_signal you'll see this function removes any non-area parameters if data_type = 'facs' and removes any non-element parameters if data_type = 'cytof'.

In [ ]:
adatas = []

for fileID in files:

    adata = pm.io.read_fcs(path + fileID)

    pm.pp.split_signal(
        adata, 
        var_key   = "channel", 
        data_type = "facs"
    )

    adatas.append(adata)

We can subset adatas just like any list. If we look at just the first element, we're told it's an AnnData object with 10,000 events and 8 parameters.

In [ ]:
adatas[0]
Out[ ]:
AnnData object with n_obs × n_vars = 10000 × 8
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
    uns: 'meta'

We're also told this object has a DataFrame of variable-level metadata (var), and some unstructured metadata (uns). We can access the variable-level metadata by calling the var attribute, and we see we get a DataFrame with some information about each parameter in the original .fcs file. The unstructured metadata just contains the original .fcs file keyword data, and can be accessed using the .uns attribute.

In [ ]:
adatas[0].var
Out[ ]:
n channel marker $PnB $PnE $PnG $PnR signal_type
CD3e 1 PECy5-5-A CD3e 32 0,0 1.0 262144 area
CD16-32 2 PECy7-A CD16-32 32 0,0 1.0 262144 area
Ly6G 3 DL800-A Ly6G 32 0,0 1.0 262144 area
CD45 4 AF700-A CD45 32 0,0 1.0 262144 area
CD48 5 APCCy7-A CD48 32 0,0 1.0 262144 area
CD11b 6 BUV395-A CD11b 32 0,0 1.0 262144 area
B220 7 BUV737-A B220 32 0,0 1.0 262144 area
Ly6C 8 BV605-A Ly6C 32 0,0 1.0 262144 area
In [ ]:
adatas[0].uns
Out[ ]:
OrderedDict([('meta',
              {'__header__': {'FCS format': 'FCS3.1',
                'text start': 256,
                'text end': 909,
                'data start': 910,
                'data end': 320909,
                'analysis start': 0,
                'analysis end': 0},
               '$BEGINANALYSIS': '0',
               '$BEGINDATA': '910',
               '$BEGINSTEXT': '0',
               '$BYTEORD': '1,2,3,4',
               '$DATATYPE': 'F',
               '$ENDANALYSIS': '0',
               '$ENDDATA': '320909',
               '$ENDSTEXT': '0',
               '$MODE': 'L',
               '$NEXTDATA': 0,
               '$PAR': 8,
               '$TOT': 10000,
               'channels':         $PnN     $PnS  $PnB $PnE $PnG    $PnR
               n                                            
               1  PECy5-5-A     CD3e    32  0,0  1.0  262144
               2    PECy7-A  CD16-32    32  0,0  1.0  262144
               3    DL800-A     Ly6G    32  0,0  1.0  262144
               4    AF700-A     CD45    32  0,0  1.0  262144
               5   APCCy7-A     CD48    32  0,0  1.0  262144
               6   BUV395-A    CD11b    32  0,0  1.0  262144
               7   BUV737-A     B220    32  0,0  1.0  262144
               8    BV605-A     Ly6C    32  0,0  1.0  262144,
               'header': {'FCS format': 'FCS3.1',
                'text start': 256,
                'text end': 909,
                'data start': 910,
                'data end': 320909,
                'analysis start': 0,
                'analysis end': 0}})])

At present our data are stored as a list of AnnDatas but it will be more convenient if we can merge these into a single master AnnData. We can achieve this using the .concat() method. The label argument defines the name of the new column in the observation-level metadata (obs) that will indicate the filename each event belongs to, and the keys argument defines what these values will be. The merge = 'same' argument tells the function to keep only the var and uns metadata if it is the same for all samples. This ensures that all samples have the same parameters, but results in the uns matrix being dropped as this is different for each sample (but we don't mind that).

In [ ]:
adata_all = ann.concat(
    adatas, 
    label = 'filename', 
    keys  = files, 
    merge = 'same'
)

adata_all
c:\Python_3.10.2\lib\site-packages\anndata\_core\anndata.py:1818: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
Out[ ]:
AnnData object with n_obs × n_vars = 160000 × 8
    obs: 'filename'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'

Let's take a look at this observation-level metadata I just mentioned. We can view it by calling the obs attribute of our object. At present we just have the filename column we created to indicate the file each even belongs to. Later we will add more metadata like experimental group and batch.

In [ ]:
adata_all.obs
Out[ ]:
filename
0 Mock_01_A.fcs
1 Mock_01_A.fcs
2 Mock_01_A.fcs
3 Mock_01_A.fcs
4 Mock_01_A.fcs
... ...
9995 Virus_08_B.fcs
9996 Virus_08_B.fcs
9997 Virus_08_B.fcs
9998 Virus_08_B.fcs
9999 Virus_08_B.fcs

160000 rows × 1 columns

As a quick sense-check, let's see how many events we have for each value of the filename column in obs:

In [ ]:
adata_all.obs['filename'].value_counts()
Out[ ]:
Mock_01_A.fcs     10000
Mock_02_A.fcs     10000
Mock_03_A.fcs     10000
Mock_04_A.fcs     10000
Mock_05_B.fcs     10000
Mock_06_B.fcs     10000
Mock_07_B.fcs     10000
Mock_08_B.fcs     10000
Virus_01_A.fcs    10000
Virus_02_A.fcs    10000
Virus_03_A.fcs    10000
Virus_04_A.fcs    10000
Virus_05_B.fcs    10000
Virus_06_B.fcs    10000
Virus_07_B.fcs    10000
Virus_08_B.fcs    10000
Name: filename, dtype: int64

We can see we have 10,000 events per file. We won't always have the same number of events per sample, but it's good to check we have what we expect, and to ensure we don't have files with too few events.

Transforming channel values¶

If your data were exported from FlowJo, then all the parameters will be linear. In order for our dimension reduction and clustering algorithms to provide meaningful insights, we need to transform our fluorescence/metal parameters. Pytometry provides a few common transformations such as biexponential, logicle, and asinh. The asinh (inverse hyperbolic sin) transformation is almost exclusively used for mass cytometry data, while any of these transformations may be applied to fluorescence cytometry data. You can experiment with different transformations, but for today we will use biexponential.

We do that using the normalize_biexp() function, giving the AnnData object, and s eries of parameters to the transformation. You can experiment with these paremeters, but I took these particular values directly from FlowJo. There is also a normalize_autologicle() function for automatically choosing the parameters of a logicle transformation, but I'll let you experiment with that one. These functions modify the values of the axisting AnnData object in place.

In [ ]:
pm.tl.normalize_biexp(
    adata_all,
    negative  = 0,
    width     = -100,
    positive  = 3.71,
    max_value = 65536
)

#pm.tl.normalize_autologicle(adata_all)

To check our choice of transformation is suitable, it’s a good idea to plot each transformed parameter. A quick way to check if our negative populations are squeezed or split by the transformation, is to create ridge plots (sometime called joyplots). We do this using the joyplot() function from the package of the same name. This function doesn't accept AnnData objects as input, so we convert the expression data to a DataFrame using to_df() then randomly sample 10,000 events to speed the plotting up. To the column argument we give a list of variable names we access from the var_names attribute (converting it to a list as is expected by joyplot()), where every variable will be plotted as a separate distribution. The overlap argument just controls how much the distributions overlap and you can experiment with this.

I start by using the do.subsample() function to randomly sample 10,000 events from the data (just to speed things up). Next, I use lapply() to apply a function to every element in our transformed_cols vector. The function I apply is an anonymous function that calls make.colour.plot() on the data, iterating over each element of the transformed_cols vector as the x variable, and fixing “AF700_CD45_asinh” as the y variable.

In [ ]:
p = joypy.joyplot(
    adata_all.to_df().sample(10_000),
    column  = adata.var_names.to_list(),
    overlap = 1.5
)
No description has been provided for this image

Those negative populations look ok, but looking at bivariate plots is a good idea too. Below, we use seaborn's histplot() function for plotting a 2D histogram of CD45 against every other parameter. We extract the expression data as a numpy array by calling the X (capital X) attribute from our AnnData object (or we could have extracted it as a DataFrame as we did above). We iterate over the parameters, subsetting for the column for the current parameter.

In [ ]:
for idx, name in enumerate(adata_all.var_names):
    p = sns.histplot(
        x = adata_all.X[:, idx],
        y = adata_all.X[:, 3],
        bins = 200,
        cmap = 'viridis'
    )

    plt.xlabel(name)
    plt.ylabel('CD45')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

After inspecting each of the plots, we may wish to go back and change our transformation. This transformation looks fine, so we’ll continue.

Adding sample metadata¶

Often, we will have sample metadata we wish to include in our analysis that couldn’t read in as part of our .fcs files. This might include sample identifiers and grouping variables such as treatment, concentration, and time. It’s easy to add this information to our AnnData object by manually creating a .csv file with the relevant information, and merging this with the obs DataFrame.

In your data/clean/ directory, you should have the file "sample_details.csv" that contains the metadata for this example. If you open the file as a spreadsheet, you'll see it contains the columns:

  • filename
  • sample
  • group
  • batch_label
  • batch
  • reference

It doesn’t need all of these, and could contain more, depending on what variables you want to use to annotate your samples (you could include donor age, for example). However note that (at the time of writing this) the cytonorm algorithm does expect the column batch to contain only integers and the column reference to containg the value 'ref' for reference controls, and 'other' for non-reference controls (more on all of this later).

We read this .csv file in as a DataFrame and merge it with the obs DataFrame, matching the metadata between the files using the filename column. You can think of it as taking each row in obs in turn, looking at the filename, then grabbing the metadata corresponding to that filename in metadata.

In [ ]:
metadata = pd.read_csv('data/clean/sample_details.csv')
metadata
Out[ ]:
filename sample group batch_label batch reference
0 Mock_01_A.fcs Mock_01_A Mock A 1 ref
1 Mock_02_A.fcs Mock_02_A Mock A 1 other
2 Mock_03_A.fcs Mock_03_A Mock A 1 other
3 Mock_04_A.fcs Mock_04_A Mock A 1 other
4 Virus_01_A.fcs WNV_01_A Virus A 1 other
5 Virus_02_A.fcs WNV_02_A Virus A 1 other
6 Virus_03_A.fcs WNV_03_A Virus A 1 other
7 Virus_04_A.fcs WNV_04_A Virus A 1 other
8 Mock_05_B.fcs Mock_05_B Mock B 2 ref
9 Mock_06_B.fcs Mock_06_B Mock B 2 other
10 Mock_07_B.fcs Mock_07_B Mock B 2 other
11 Mock_08_B.fcs Mock_08_B Mock B 2 other
12 Virus_05_B.fcs WNV_05_B Virus B 2 other
13 Virus_06_B.fcs WNV_06_B Virus B 2 other
14 Virus_07_B.fcs WNV_07_B Virus B 2 other
15 Virus_08_B.fcs WNV_08_B Virus B 2 other
In [ ]:
adata_all.obs = pd.merge(adata_all.obs, metadata, on = 'filename', how = 'left')
c:\Python_3.10.2\lib\site-packages\anndata\_core\anndata.py:767: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    [0, 1, 2, 3, 4]

    Inferred to be: integer

  value_idx = self._prep_dim_index(value.index, attr)

We get a little warning from AnnData telling us it likes the index (the rownames) of obsto be strings, but the merge() function returned integers. We fix that easily below and look at our newly added metadata.

In [ ]:
adata_all.obs.index = adata_all.obs.index.astype(str)
adata_all.obs
Out[ ]:
filename sample group batch_label batch reference
0 Mock_01_A.fcs Mock_01_A Mock A 1 ref
1 Mock_01_A.fcs Mock_01_A Mock A 1 ref
2 Mock_01_A.fcs Mock_01_A Mock A 1 ref
3 Mock_01_A.fcs Mock_01_A Mock A 1 ref
4 Mock_01_A.fcs Mock_01_A Mock A 1 ref
... ... ... ... ... ... ...
159995 Virus_08_B.fcs WNV_08_B Virus B 2 other
159996 Virus_08_B.fcs WNV_08_B Virus B 2 other
159997 Virus_08_B.fcs WNV_08_B Virus B 2 other
159998 Virus_08_B.fcs WNV_08_B Virus B 2 other
159999 Virus_08_B.fcs WNV_08_B Virus B 2 other

160000 rows × 6 columns

Batch alignment¶

If samples were not all stained and collected together, it’s possible that batch effects exist in the data that contribute added noise. Preventing batch effects before they occur is the best strategy, but if samples must be collected at distant time points, then we can mitigate their impact by including a common sample to each round of staining and acquisition, and using the CytoNorm algorithm to align the batches based on these batch control files.

Let's start by looking for evidence of a batch effect between our two batches. A fairly effective way of doing this is to find a lower-dimensional representation of the data, and visualize it to see if batches are overlapping. We'll use the UMAP dimension reduction algorithm for this.

To speed things up, we start by taking a random sample of 50,000 events from our AnnData object and storing this in a new AnnData object called adata_sub. Before running UMAP, we must calculate the distance between each event and its nearest neighbors using sc.pp.neighbors(). We then calculate the UMAP embedding with sc.tl.umap(), where min_dist controls the granularity of the final embedding. Finally, we can plot the embedding using sc.pl.umap(), giving a list of variable sfrom obs we wish to colour by (each resulting in a subplot).

In [ ]:
adata_sub = sc.pp.subsample(adata_all, n_obs = 50_000, copy = True)

sc.pp.neighbors(adata_sub)

sc.tl.umap(adata_sub, min_dist = 0.1)

sc.pl.umap(adata_sub, color = ['batch_label', 'group', 'sample'], size = 3, ncols = 2)
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
No description has been provided for this image

The most salient plot at the moment is the one showing batch_label, and we can see that the two batches do not overlap (remember these data are from the same biological sample and should be near identical). Let's proceed with performing batch correction to reduce the impact of this batch effect.

Preparing the batch control data¶

We start by creating a CytoNorm object, and adding a FlowSOM clusterer that will partition the data into 8 clusters. The cytonorm algorithm relies on first clustering the data so that each cluster can be aligned separately, and we only need to capture the broad cell types for now.

In [ ]:
cn = cnp.CytoNorm()
fs = cnp.FlowSOM(n_clusters = 5)
cn.add_clusterer(fs)

It's useful at this point to introduce AnnData's layers functionality. As we work with an AnnData object we may make changes to the data X, for example in transforming it or normalising it, and it can be helpful to retain multiple versions of the data, rather than just keep overwriting it. We can do this by creating layers within our AnnData object. We'll do this now so we can have seperate layers for the original, and normalised expression data.

To add a new layer, we use the layers attribute and use square brackets as if we're adding a new column to a DataFrame. In the example below, I just copy the existing values of X. When we call adata_all, we can see we now have a layer called 'original'.

In [ ]:
adata_all.layers['original'] = adata_all.X.copy()
adata_all
Out[ ]:
AnnData object with n_obs × n_vars = 160000 × 8
    obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
    layers: 'original'

Next, we run the run_anndata_setup() method, which lets us define which layer we're going to use as the input data, the name of the layer that will contain the normalised data, and the columns in obs that indicate the batch and sample of each file and whether it is a reference control or not.

In [ ]:
cn.run_anndata_setup(
    adata_all, 
    layer                    = 'original', 
    reference_column         = 'reference',
    batch_column             = 'batch',
    sample_identifier_column = 'sample',
    key_added                = 'normalised'
)

Looking at adata_all again shows an additional layer that will eventually contain the normalised data (it is just a copy of original at present).

In [ ]:
adata_all
Out[ ]:
AnnData object with n_obs × n_vars = 160000 × 8
    obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
    layers: 'original', 'normalised'

Now we can run the clustering we defined earlier.

In [ ]:
cn.run_clustering()

Training the model and aligning the data¶

We then calculate quantiles per marker, per cluster, and calculate splines that map each of these distributions to the mean distributions of the batch control files. The normalize_data() method then aligns the data based on the trained model.

In [ ]:
cn.calculate_quantiles()
cn.calculate_splines(goal = "batch_mean")
cn.normalize_data()
normalized file Mock_07_B
normalized file Mock_04_A
normalized file Mock_06_B
normalized file WNV_02_A
normalized file Mock_03_A
normalized file Mock_02_A
normalized file WNV_01_A
normalized file Mock_08_B
normalized file WNV_03_Anormalized file WNV_05_B

normalized file WNV_04_A
normalized file WNV_06_B
normalized file WNV_07_B
normalized file WNV_08_B

If we want to normalise any files not included in the initial training, or normalize the batch control files themselves, we need to give a list of file names and corresponding list of batch memberships the the normalize_data() method.

In [ ]:
cn.normalize_data(file_names = ['Mock_01_A', 'Mock_05_B'], batches = [1, 2])
normalized file Mock_01_A
normalized file Mock_05_B

Visualising the alignment¶

It's a good idea to check if we're happy cytonorm's clustering has captured the broad cell types, and that the batch normalisation has been effective.

First, we access the _clustering attribute of our cn object and apply its calculate_clusters() method to the expression matrix to get an array of event cluster memberships. We then convert the cluster memberships to strings (to improve plotting) and store them as a column in obs.

In [ ]:
cn_clusters = cn._clustering.calculate_clusters(adata_all.layers['original'])
cn_clusters
Out[ ]:
array([4, 4, 3, ..., 1, 1, 3], dtype=int64)
In [ ]:
adata_all.obs['cytonorm_cluster'] = cn_clusters.astype('str')
adata_all.obs
Out[ ]:
filename sample group batch_label batch reference cytonorm_cluster
0 Mock_01_A.fcs Mock_01_A Mock A 1 ref 4
1 Mock_01_A.fcs Mock_01_A Mock A 1 ref 4
2 Mock_01_A.fcs Mock_01_A Mock A 1 ref 3
3 Mock_01_A.fcs Mock_01_A Mock A 1 ref 3
4 Mock_01_A.fcs Mock_01_A Mock A 1 ref 4
... ... ... ... ... ... ... ...
159995 Virus_08_B.fcs WNV_08_B Virus B 2 other 3
159996 Virus_08_B.fcs WNV_08_B Virus B 2 other 2
159997 Virus_08_B.fcs WNV_08_B Virus B 2 other 1
159998 Virus_08_B.fcs WNV_08_B Virus B 2 other 1
159999 Virus_08_B.fcs WNV_08_B Virus B 2 other 3

160000 rows × 7 columns

We're going to calculate another UMAP embedding of the data to visualise the impact of batch normalisation, as well as to colou by membership to cytonorm's clusters, to decide whether we think it's capturing the broad cell types present.

The sc.pp.neighbors() function we need to use to calculate the nearest neighbour distances, doesn't take a layer as input, but it can take a matrix from our AnnData's obsm (obsersation matrix) subfield. This subfield is for storing observation level data that may have many dimensions to it. Because we want to use the normalised layer data to compute the UMAP, we simply copy this to a obsm slot.

In [ ]:
adata_all.obsm['normalised'] = adata_all.layers['normalised']
adata_all.obsm
Out[ ]:
AxisArrays with keys: normalised

Now we take another subsample, calculate the nearest neighbour distances, calculate the UMAP embedding, and plot the data, colouring by cytonorm_cluster, batch_label, and group.

In [ ]:
adata_sub = sc.pp.subsample(adata_all, n_obs = 50_000, copy = True)

sc.pp.neighbors(adata_sub, use_rep = 'normalised')

sc.tl.umap(adata_sub, min_dist = 0.1)

sc.pl.umap(
    adata_sub, 
    layer = 'normalised', 
    color = ['cytonorm_cluster', 'batch_label', 'group'], 
    size  = 3, 
    ncols = 1
)
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
No description has been provided for this image

We can see the batches are now much more superimposed over each other. The cytonorm clusters do an ok job of capturing the broad cell types. But while looking an a UMAP gives a quick overview, it's a good idea to compare the parameter distributions pre-and post normalisation for the batch control files. First, let's create a new AnnData object containing just the data from the batch control samples.

In [ ]:
adata_ref = adata_all[adata_all.obs['reference'] == 'ref']
adata_ref
Out[ ]:
View of AnnData object with n_obs × n_vars = 20000 × 8
    obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference', 'cytonorm_cluster'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
    obsm: 'normalised'
    layers: 'original', 'normalised'

Next, we create separate DataFrames containing the original and normalised expression data. We add columns indicating whether the data are original or normalised, and the batch number for each event. Finally, we concatenate these together into a single DataFrame called both.

In [ ]:
orig = adata_ref.to_df(layer = 'original')
orig['group'] = 'original'
orig['batch'] = adata_ref.obs.batch

norm = adata_ref.to_df(layer = 'normalised')
norm['group'] = 'normalised'
norm['batch'] = adata_ref.obs.batch

both = pd.concat([orig, norm])
both
Out[ ]:
CD3e CD16-32 Ly6G CD45 CD48 CD11b B220 Ly6C group batch
0 1141.502801 1884.283173 631.694678 2096.491398 2345.868626 1541.958951 3431.587756 1197.512280 original 1
1 1146.759063 1001.825654 911.361470 1859.862764 1385.111571 1323.273261 3219.384459 1711.602316 original 1
2 1147.575687 2237.875240 844.361401 2068.902565 1974.999994 2711.006654 1238.627796 2597.616532 original 1
3 1329.867442 2329.289555 791.594765 1443.450031 2116.643027 3098.270045 1792.858119 2867.574559 original 1
4 1043.718253 2040.863053 342.438555 1501.455922 2602.254838 2053.469957 2690.738776 1301.960150 original 1
... ... ... ... ... ... ... ... ... ... ...
49995 1093.597985 1253.195038 -48.364905 2676.159933 2916.001651 1291.851753 3187.865781 1560.876894 normalised 2
49996 1198.776982 2779.748414 3323.263529 2151.755543 1519.962398 3228.117012 1025.970601 2422.467398 normalised 2
49997 1473.892663 2433.838854 1060.009231 1707.809461 1881.749736 2775.319616 1575.442517 2821.307012 normalised 2
49998 1307.985890 2601.736766 2882.050968 2027.890835 929.216557 3308.465222 1435.532210 2285.653496 normalised 2
49999 1088.871324 2792.563676 3106.601275 1940.889623 1197.101574 3130.648391 1129.639749 2035.476600 normalised 2

40000 rows × 10 columns

Now we have this DataFrame of annotated, batch control data, we use the joyplot() function from the package of the same name, to visualise the parameter distributions. As you scan through, you'll noice some discrepancies between the distributions pre-normalisation, that are corrected by cytonorm.

In [ ]:
for param in adata.var_names.to_list():
    p = joypy.joyplot(
        both,
        title   = param,
        by      = ['group', 'batch'],
        column  = param,
        overlap = 1.5,
        grid    = True,
        color   = ['#ABDDA4', '#ABDDA4', '#FDAE61', '#FDAE61']
    )
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

If we weren't happy with the normalisation, we could go back and change the number of clusters, for example. But this looks as though the batch effect has been removed, so now let's replace the X expression matrix in our AnnData object with the normalised expression matrix.

In [ ]:
adata_all.X = adata_all.layers['normalised']

Clustering¶

To help us identify clusters or populations of cells in our data, we can employ a clustering algorithm to partition the events into groups that are more similar to each other than they are to events in other groups. Traditional gating is just "clustering by hand", but can be subjective and miss important populations when the parameter space is large.

The pytometry package currently only has an implementation of the FlowSOM clustering algorithm (which is probably the most widely used), but there are Python implementations of other clustering algorithms you could apply, such as phenograph. FlowSOM and phenograph are state of the art and perform well for most cytometry clustering problems. In this example we'll use FlowSOM.

Performing FlowSOM clustering is easy: we simply provide the AnnData object as the first argument, the name of the column that will indicate cluster membership in obs, and arguments that control the clustering algorithm.

The som_dim controls the dimensions of the self organizing map . The greater the dimensions of the SOM, the more initial clusters the data are broken up into. The default is (10, 10), but here I set it to (15, 15) as it seems to result in better clustering. We can set the random seed to ensure reproducibility (pick your favourite number of leave the default of 42), and the min_clusters and max_clusters arguments. If these have the same value, we are essentially instructing FlowSOM to return exactly that number of metaclusters. Otherwise, we can give the algorithm different minimum and maximum numbers, and let it choose for us (it tends to under cluster). It is usually best to set the number of metaclusters a little higher than the number of populations you think may be identifiable in the data (as we can easily merge clusters). We can alter the number of metaclusters later if we believe we are under or over clustering.

In [ ]:
pm.tl.flowsom_clustering(
    adata_all,
    key_added    = 'fsom_metacluster',
    som_dim      = (15, 15),
    seed         = 24601,
    min_clusters = 15,
    max_clusters = 15,
    verbose      =  True
)
 [ 500 / 500 ] 100% - 0:00:00 left 
 quantization error: 650.0046638720019
Computing consensus matrices:   0%|          | 0/1 [00:00<?, ?it/s]
c:\Python_3.10.2\lib\site-packages\kneed\knee_locator.py:225: RuntimeWarning: invalid value encountered in divide
  return (a - min(a)) / (max(a) - min(a))
c:\Python_3.10.2\lib\site-packages\kneed\knee_locator.py:208: RuntimeWarning: Mean of empty slice.
  self.S * np.abs(np.diff(self.x_normalized).mean())
c:\Python_3.10.2\lib\site-packages\numpy\core\_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
c:\Python_3.10.2\lib\site-packages\consensusclustering\consensus.py:464: UserWarning: Kneedle algorithm failed to find a knee. Returning maximum number of clusters, however, it is likely that the clustering is unstable. Plot the CDFs and consensus matrices to check.
  warn(
Assigning cluster labels to cells:   0%|          | 0/160000 [00:00<?, ?it/s]
Out[ ]:
AnnData object with n_obs × n_vars = 160000 × 8
    obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference', 'cytonorm_cluster', 'fsom_metacluster'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
    obsm: 'normalised'
    layers: 'original', 'normalised'

We can access the cluster memberships from the fsom_metacluster column from obs. Below we just show we do indeed have 15 metaclusters.

In [ ]:
len(adata_all.obs['fsom_metacluster'].unique())
Out[ ]:
15

Let's subsample the data and calculate the UMAP embedding again, so we can colour the points by metacluster.

In [ ]:
adata_sub = sc.pp.subsample(adata_all, n_obs = 50_000, copy = True)

sc.pp.neighbors(adata_sub)

sc.tl.umap(adata_sub, min_dist = 0.1)

sc.pl.umap(adata_sub, color = 'fsom_metacluster', size = 8)
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
No description has been provided for this image

Finally, to help us assign cell types to our clusters ("annotating" the clusters), let’s plot the UMAP embeddings, but faceting by antigen, and colouring by expression of that antigen.

In [ ]:
sc.pl.umap(
    adata_sub,
    gene_symbols = 'marker',
    color        = adata_sub.var_names,
    ncols        = 3
)
No description has been provided for this image

We can also create bivariate plots, colouring by metacluster, as this may help us when annotating the clusters later. Below we plot every parameter against CD45.

In [ ]:
for param in adata_sub.var_names:
    sc.pl.scatter(adata_sub, x = param, y = 'CD45', color = 'fsom_metacluster', size = 8, )
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Annotating our clusters¶

The process of assigning a biological cell type to each cluster is usually the hardest and most labour intensive part of the analysis. To help with this, in addition to the UMAP plots we created a moment ago, scanpy provides us with a few different visualisation tools. You may not like all of them, but we'll look at them here one by one so you have a feeel of what's available.

Let's start with what scanpy calls a dotplot, which is a heatmap where the size of each "dot" is mapped to an estimate of the proportion of events positive for each marker. This is calculated based on a single expression cutoff value, which for the example we set to 2000.

In [ ]:
sc.pl.dotplot(
    adata_all, 
    var_names         = adata_all.var_names, 
    groupby           = 'fsom_metacluster', 
    dendrogram        = True,
    expression_cutoff = 2000
)
WARNING: dendrogram data not found (using key=dendrogram_fsom_metacluster). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_dotplot.py:747: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
  dot_ax.scatter(x, y, **kwds)
No description has been provided for this image

Another way to visualize metacluster expression is through violin plots. Here we iterate through each parameter and visualize its distribution per metacluster with violin plots. Setting stripplot = False suppresses the drawing of individual events, and speeds up plotting. These plots are also useful for identifying under-clustering, where we may see bimodal or even multimodal parameter distributions.

In [ ]:
for param in adata_all.var_names:
    sc.pl.violin(
    adata_all, 
    keys      = param, 
    groupby   = 'fsom_metacluster', 
    rotation  = 90,
    stripplot = False    
)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

But the visualisation many people are more comfortable with is the humble heatmap, which scanpy calls a matrixplot.

In [ ]:
sc.pl.matrixplot(
    adata_all, 
    var_names  = adata_all.var_names, 
    groupby    = 'fsom_metacluster', 
    dendrogram = True
)
No description has been provided for this image

The heatmap above is displaying the mean of the transformed parameter values per metacluster. But it's also useful to scale the expression values between 0 and 1 within each column (parameter) first, to visualise the metaclusters expressing the most and least amount of each antigen. We do this by providing standard_scale = 'var' (to do the same within each column, set this argument to 'group', but this isn't as useful).

In [ ]:
sc.pl.matrixplot(
    adata_all, 
    var_names      = adata_all.var_names, 
    groupby        = 'fsom_metacluster', 
    dendrogram     = True,
    standard_scale = 'var',
    colorbar_title = 'Column scaled\nexpression'
)
No description has been provided for this image

The visualization scanpy calls a heatmap is similar to the visualization above, except that every individual event is included in the plot. The only useful thing about this is that we can clearly see the proportion of total each metacluster represents, and we can use this information to inform cell type annotations.

In [ ]:
sc.pl.heatmap(
    adata_all, 
    var_names      = adata_all.var_names, 
    groupby        = 'fsom_metacluster', 
    standard_scale = 'var',
    dendrogram     = True
)
No description has been provided for this image

Armed with our visualisations above, now we need to start assigning cell types to our FlowSOM metaclusters. Note that it’s better to over cluster than to under cluster, as at this point we can merge clusters we believe represent the same cell type.

To annotate, we create a dictionary of key:value pairs, where the key of each element is the name of the metacluster, and the value of each element is cell type we wish to assign to that metacluster.

Once we have created our dictionary, we use the map() method on the fsom_metacluster column, which looks at each value of that column, finds the corresponding key in the dictionary we provide as an argument, and pulls out the value. This is then returned as a single series which we assign as a column of the obs subfield.

In [ ]:
annotations = {
    'cluster_0'  : 'Other/Unknown',
    'cluster_1'  : 'Monocytes',
    'cluster_2'  : 'Other/Unknown',
    'cluster_3'  : 'Monocytes',
    'cluster_4'  : 'Other/Unknown',
    'cluster_5'  : 'Mature neutrophils',
    'cluster_6'  : 'Immature neutrophils',
    'cluster_7'  : 'Mature neutrophils',
    'cluster_8'  : 'Plasma cells',
    'cluster_9'  : 'B cells',
    'cluster_10' : 'T cells',
    'cluster_11' : 'T cells',
    'cluster_12' : 'T cells',
    'cluster_13' : 'B cells',
    'cluster_14' : 'B cells',
}

adata_all.obs['population'] = adata_all.obs['fsom_metacluster'].map(annotations)
adata_all.obs
Out[ ]:
filename sample group batch_label batch reference cytonorm_cluster fsom_metacluster population
0 Mock_01_A.fcs Mock_01_A Mock A 1 ref 4 cluster_14 B cells
1 Mock_01_A.fcs Mock_01_A Mock A 1 ref 4 cluster_9 B cells
2 Mock_01_A.fcs Mock_01_A Mock A 1 ref 3 cluster_1 Monocytes
3 Mock_01_A.fcs Mock_01_A Mock A 1 ref 3 cluster_1 Monocytes
4 Mock_01_A.fcs Mock_01_A Mock A 1 ref 4 cluster_9 B cells
... ... ... ... ... ... ... ... ... ...
159995 Virus_08_B.fcs WNV_08_B Virus B 2 other 3 cluster_3 Monocytes
159996 Virus_08_B.fcs WNV_08_B Virus B 2 other 2 cluster_11 T cells
159997 Virus_08_B.fcs WNV_08_B Virus B 2 other 1 cluster_5 Mature neutrophils
159998 Virus_08_B.fcs WNV_08_B Virus B 2 other 1 cluster_5 Mature neutrophils
159999 Virus_08_B.fcs WNV_08_B Virus B 2 other 3 cluster_6 Immature neutrophils

160000 rows × 9 columns

So we can create a nice UMAP plot labelled with our cell types, let’s do the same for the adata_sub DataFrame, then plot the UMAP embedding with the cell type labels.

In [ ]:
adata_sub.obs['population'] = adata_sub.obs['fsom_metacluster'].map(annotations)
sc.pl.umap(adata_sub, color = 'population', legend_loc = 'on data')
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
No description has been provided for this image

Lastly, let’s recreate our expression heatmaps using our annotated cell types as rows, instead of the metaclusters.

In [ ]:
sc.pl.matrixplot(
    adata_all, 
    var_names      = adata_all.var_names, 
    groupby        = 'population', 
    dendrogram     = True,
    standard_scale = 'var',
    colorbar_title = 'Column scaled\nexpression'
)
WARNING: dendrogram data not found (using key=dendrogram_population). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
No description has been provided for this image
In [ ]:
sc.pl.heatmap(
    adata_all, 
    var_names      = adata_all.var_names, 
    groupby        = 'population', 
    standard_scale = 'var',
    dendrogram     = True
)
No description has been provided for this image

Exporting the data¶

Now that our original .fcs data are combined and annotated with the metadata and population data, it’s a good idea to save a snapshot so we can come back to where we left off in future. The write_csvs() method will write seperate .csv files for each subfield of our AnnData object, while the write() method stores the entire object as a .h5ad file, which is AnnData's file format. We can read in an AnnData object from its .h5ad file using the read_h5ad() function (I do this just as a demonstration below).

In [ ]:
adata_all.write_csvs('data/exported/annotated')
adata_all.write('data/exported/annotated.h5ad')
adata_2 = ann.read_h5ad('data/exported/annotated.h5ad')
adata_2
Out[ ]:
AnnData object with n_obs × n_vars = 160000 × 8
    obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference', 'cytonorm_cluster', 'fsom_metacluster', 'population'
    var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
    uns: 'dendrogram_fsom_metacluster', 'dendrogram_population', 'fsom_metacluster_colors', 'population_colors'
    obsm: 'normalised'
    layers: 'normalised', 'original'

Statistical analysis¶

So far we have partitioned our dataset into populations of cells. Now what? The next step in your analysis will depend on the questions you are asking, but a general starting point is to generate summary statistics of our data that include:

  • the proportion of all events in each population
  • the median fluorescence/mass intensity (MFI) of each antigen within population

Generating summary statistics¶

We can generate the percentage of total made up by each population in each sample, using the crosstab() function from Pandas. By giving the first and second arguments as the 'sample' and 'population' columns from obs, the function will return a DataFrame of counts (i.e. the number of cells of each type in each sample). By specifying normalize = 'index', we ask the function to convert those counts into a proportion that sums to 1 across each row. We then just multiply by 100 to turn this into a percentage. We then use reset_index() to have the 'sample' column (instead of as an index), and then just loop over the cell type columns, renaming them to include the 'percent_' prefix.

In [ ]:
per = pd.crosstab(adata_all.obs['sample'], adata_all.obs['population'], normalize = 'index') * 100
per.reset_index(inplace = True)
per.columns = [('percent_' + col if col != 'sample' else 'sample') for col in per.columns]
per
Out[ ]:
sample percent_B cells percent_Immature neutrophils percent_Mature neutrophils percent_Monocytes percent_Other/Unknown percent_Plasma cells percent_T cells
0 Mock_01_A 29.87 1.50 43.60 12.53 6.76 1.51 4.23
1 Mock_02_A 31.57 1.82 42.13 11.51 7.65 1.34 3.98
2 Mock_03_A 31.82 1.90 40.36 13.22 7.54 1.31 3.85
3 Mock_04_A 31.60 2.18 41.10 13.27 7.00 1.42 3.43
4 Mock_05_B 30.42 2.12 44.06 12.31 6.87 1.06 3.16
5 Mock_06_B 31.08 1.64 43.07 12.93 6.51 1.14 3.63
6 Mock_07_B 29.99 1.92 42.56 12.22 6.58 1.00 5.73
7 Mock_08_B 29.85 3.04 43.87 14.88 5.97 0.35 2.04
8 WNV_01_A 16.26 4.01 35.93 24.68 7.76 3.81 7.55
9 WNV_02_A 16.30 4.75 33.96 26.21 7.96 4.08 6.74
10 WNV_03_A 23.64 4.71 29.93 24.62 7.09 4.22 5.79
11 WNV_04_A 17.66 4.58 39.70 19.59 6.25 4.12 8.10
12 WNV_05_B 14.68 6.69 35.84 26.16 7.81 2.44 6.38
13 WNV_06_B 22.34 4.61 35.54 18.67 7.53 2.99 8.32
14 WNV_07_B 18.21 5.12 36.67 20.97 7.36 3.31 8.36
15 WNV_08_B 19.88 5.74 34.35 21.94 7.40 2.94 7.75

Creating a DataFrame of median intensities per sample and cell type is going to take a bit more work. We start by creating a DataFrame from the X matrix of our AnnData object, taking the var_names as the column names. We add columns to indicate which sample and population each event belongs to (converting the 'population' column to string to avoid issues later).

In [ ]:
df = pd.DataFrame(adata_all.X, columns = adata_all.var_names)
df['sample']     = adata_all.obs['sample'].values
df['population'] = adata_all.obs['population'].values
df['population'] = df['population'].astype(str)

df
Out[ ]:
CD3e CD16-32 Ly6G CD45 CD48 CD11b B220 Ly6C sample population
0 1131.903513 1872.116820 694.669156 2297.480400 2463.423637 1297.783650 3357.454591 1169.627907 Mock_01_A B cells
1 1137.770474 950.059162 904.626120 1983.218740 1594.213946 1108.404364 3150.453893 1655.764651 Mock_01_A B cells
2 1165.112501 2302.928927 874.856097 2195.772398 1992.812869 2640.528234 1279.057770 2509.351436 Mock_01_A Monocytes
3 1396.375650 2399.815210 821.163524 1567.501324 2120.240060 3068.668942 1869.271736 2770.973644 Mock_01_A Monocytes
4 1029.059341 2040.455759 475.916973 1636.949269 2727.709972 1732.400436 2653.900143 1262.789876 Mock_01_A B cells
... ... ... ... ... ... ... ... ... ... ...
159995 1163.559107 2029.841171 1002.002421 2102.715461 2934.872918 1585.463913 1251.227131 2969.636383 WNV_08_B Monocytes
159996 2205.121457 1106.366082 393.125111 3449.508003 2831.568270 1406.120611 1439.992540 1220.899638 WNV_08_B T cells
159997 986.625108 2726.911881 3366.489662 2176.967886 777.658738 3172.651040 1180.258926 2805.032132 WNV_08_B Mature neutrophils
159998 1052.918856 2361.529652 2118.016131 1828.316497 1409.880745 3205.967079 1333.772320 1949.198108 WNV_08_B Mature neutrophils
159999 1464.842892 2128.241108 1280.975755 1963.775613 2261.822362 3533.230152 1579.737874 3386.055862 WNV_08_B Immature neutrophils

160000 rows × 10 columns

Now we group by each combination of 'sample' and 'population' and calculate the median of each column, using reset_index() to bring those grouping variables back in as columns (rather than in the index).

In [ ]:
summary_df = df.groupby(['sample', 'population']).median().reset_index()
summary_df
Out[ ]:
sample population CD3e CD16-32 Ly6G CD45 CD48 CD11b B220 Ly6C
0 Mock_01_A B cells 1101.761309 1423.115437 935.259989 1749.991288 2373.619364 1277.806177 3090.182943 1206.798099
1 Mock_01_A Immature neutrophils 1165.109258 2630.754903 1411.483259 1820.910273 2056.989269 3038.914234 1307.665802 2976.948647
2 Mock_01_A Mature neutrophils 1119.270504 2642.445614 2948.457340 1941.780051 1299.029195 3192.043609 1215.742070 2404.884796
3 Mock_01_A Monocytes 1162.399840 2753.198221 751.430865 2021.853018 2791.237881 2847.182505 1265.946475 3394.221042
4 Mock_01_A Other/Unknown 1112.818261 1842.080300 953.763330 1782.565249 2322.828900 1763.119510 1259.278593 1339.180622
... ... ... ... ... ... ... ... ... ... ...
107 WNV_08_B Mature neutrophils 1099.134412 2665.096695 2591.212414 2020.342076 1531.437595 3189.083968 1255.425786 2238.279776
108 WNV_08_B Monocytes 1174.162516 2807.642119 738.106449 2446.502994 2864.042728 2842.670958 1334.504967 3286.607718
109 WNV_08_B Other/Unknown 1136.439127 1733.113755 929.854376 2038.594674 2377.648570 1812.256466 1321.196667 1508.252040
110 WNV_08_B Plasma cells 1361.025777 1241.655017 599.774053 2650.891106 2719.473533 1500.753960 3367.519860 3303.251160
111 WNV_08_B T cells 2478.055265 1287.741989 754.449005 2923.971259 2540.479341 1532.117473 1276.494220 2808.743431

112 rows × 10 columns

What we're heading towards is a DataFrame where we have one row per sample, and a separate column for each summary statistic (e.g. median_B220_B cells). The easiest way to do this is to first melt() the data into a longer format so that we can then create a new column that combines the marker name and population name. When melting the DataFrame below, we indicate the id_vars that uniquely identify each row (and aren't changed during melting), then give the name of the column that will hold the original column names (var_name) and the name of the column that will hold the original column values (value_name).

In [ ]:
melted_df = summary_df.melt(
    id_vars    = ['sample', 'population'], 
    var_name   = 'marker', 
    value_name = 'median'
)

melted_df
Out[ ]:
sample population marker median
0 Mock_01_A B cells CD3e 1101.761309
1 Mock_01_A Immature neutrophils CD3e 1165.109258
2 Mock_01_A Mature neutrophils CD3e 1119.270504
3 Mock_01_A Monocytes CD3e 1162.399840
4 Mock_01_A Other/Unknown CD3e 1112.818261
... ... ... ... ...
891 WNV_08_B Mature neutrophils Ly6C 2238.279776
892 WNV_08_B Monocytes Ly6C 3286.607718
893 WNV_08_B Other/Unknown Ly6C 1508.252040
894 WNV_08_B Plasma cells Ly6C 3303.251160
895 WNV_08_B T cells Ly6C 2808.743431

896 rows × 4 columns

Then we just overwrite the 'population' column to contain the marker and population names pasted together.

In [ ]:
melted_df['population'] = 'median_' + melted_df['marker'] + '_' + melted_df['population']
melted_df
Out[ ]:
sample population marker median
0 Mock_01_A median_CD3e_B cells CD3e 1101.761309
1 Mock_01_A median_CD3e_Immature neutrophils CD3e 1165.109258
2 Mock_01_A median_CD3e_Mature neutrophils CD3e 1119.270504
3 Mock_01_A median_CD3e_Monocytes CD3e 1162.399840
4 Mock_01_A median_CD3e_Other/Unknown CD3e 1112.818261
... ... ... ... ...
891 WNV_08_B median_Ly6C_Mature neutrophils Ly6C 2238.279776
892 WNV_08_B median_Ly6C_Monocytes Ly6C 3286.607718
893 WNV_08_B median_Ly6C_Other/Unknown Ly6C 1508.252040
894 WNV_08_B median_Ly6C_Plasma cells Ly6C 3303.251160
895 WNV_08_B median_Ly6C_T cells Ly6C 2808.743431

896 rows × 4 columns

Now we have this, we can do the opposite of melt() and pivot() the DataFrame, telling it the index that uniquely defines each row, the column that will become the new columns, and the column that contains their values. After resetting the index, we display it so you can see what we've done.

In [ ]:
med = melted_df.pivot(
    index   = 'sample', 
    columns = 'population', 
    values  = 'median'
)

med.reset_index(inplace = True)

med
Out[ ]:
population sample median_B220_B cells median_B220_Immature neutrophils median_B220_Mature neutrophils median_B220_Monocytes median_B220_Other/Unknown median_B220_Plasma cells median_B220_T cells median_CD11b_B cells median_CD11b_Immature neutrophils ... median_Ly6C_Other/Unknown median_Ly6C_Plasma cells median_Ly6C_T cells median_Ly6G_B cells median_Ly6G_Immature neutrophils median_Ly6G_Mature neutrophils median_Ly6G_Monocytes median_Ly6G_Other/Unknown median_Ly6G_Plasma cells median_Ly6G_T cells
0 Mock_01_A 3090.182943 1307.665802 1215.742070 1265.946475 1259.278593 3301.787883 1099.843622 1277.806177 3038.914234 ... 1339.180622 3104.558784 2135.911893 935.259989 1411.483259 2948.457340 751.430865 953.763330 767.930608 579.926444
1 Mock_02_A 3085.970961 1272.196495 1208.323320 1238.359503 1240.840479 3249.748679 1074.725763 1314.362718 3063.704116 ... 1362.432823 3037.105580 1499.337551 924.586287 1471.019450 2886.312281 755.706007 939.438870 637.209433 579.838249
2 Mock_03_A 3041.428383 1281.298668 1203.842502 1221.915165 1234.026814 3380.545250 1032.733042 1217.521631 3082.020307 ... 1297.972361 3005.799229 1699.243312 920.232321 1441.387585 3011.716552 750.735128 925.098875 625.057153 580.685089
3 Mock_04_A 3034.436092 1307.427494 1204.489710 1244.240153 1239.408774 3383.755132 1060.823943 1201.893696 3020.714550 ... 1322.584431 3007.110568 1707.953541 923.717498 1422.946623 3051.354286 750.476799 950.173689 573.280932 593.830408
4 Mock_05_B 3086.026846 1318.373726 1215.679054 1252.564717 1202.721633 3339.117042 1167.188129 1277.911177 3131.604786 ... 1318.134776 3191.162596 1560.061943 934.937702 1479.374802 2950.421613 717.472601 911.502331 676.084399 807.338025
5 Mock_06_B 3188.706530 1256.862809 1210.078981 1253.486391 1248.673720 3340.692055 1170.962175 1236.810068 3108.493912 ... 1364.021029 3154.266730 1498.393482 970.337384 1565.147233 3005.693770 791.220549 986.831826 634.909694 913.175047
6 Mock_07_B 3317.330232 1290.941890 1240.125675 1256.407507 1248.860084 3309.160681 1164.291024 1213.658388 3116.276716 ... 1350.879769 3105.305477 1572.118538 1033.288541 1551.196799 3038.078472 808.987159 1006.440278 736.455393 933.515144
7 Mock_08_B 3049.365297 1384.366367 1212.416547 1218.975442 1217.381433 3025.927264 1097.460705 1261.309565 3142.063365 ... 1359.002096 3129.048865 2609.474726 973.210710 1408.020458 3041.685610 761.469904 1004.248264 792.685201 781.961909
8 WNV_01_A 3428.948815 1370.059547 1211.098785 1306.238835 1293.896179 3346.580922 1076.180702 1283.446672 3182.772890 ... 1515.006429 3111.122996 2704.455244 849.190544 1366.928840 2826.922345 692.327427 930.563518 668.818900 507.506607
9 WNV_02_A 3358.582495 1325.453077 1210.412213 1297.398187 1286.650627 3331.798330 1080.870984 1278.262997 3100.696801 ... 1542.863123 3137.276400 2835.085762 859.054324 1349.238459 2751.546040 711.855131 913.830234 672.286197 476.394896
10 WNV_03_A 3232.116295 1344.004772 1236.760987 1337.517765 1323.466632 3324.163809 1126.636678 1262.099822 3041.301944 ... 1549.755835 3150.929616 2937.027799 891.618697 1363.768406 2584.591178 713.370671 911.547972 675.465829 388.150655
11 WNV_04_A 3113.307588 1321.082754 1226.724491 1283.410313 1311.792947 3309.123193 1094.183150 1232.483117 3138.744220 ... 1510.893847 3188.097947 2833.182465 861.382952 1433.205478 2858.445502 760.760886 913.774101 671.511543 540.268419
12 WNV_05_B 3507.954731 1352.112915 1243.638338 1333.292239 1324.828792 3344.040171 1288.899325 1175.217679 3294.425312 ... 1519.219566 3234.496025 2935.956810 901.263874 1316.214670 2627.029764 782.437117 957.514527 737.582369 789.656096
13 WNV_06_B 3460.297173 1394.089066 1235.050305 1346.164972 1316.592450 3432.972806 1218.058724 1200.619376 3053.456469 ... 1535.759098 3319.703598 2554.642605 946.795196 1360.411652 2778.339594 786.406039 942.652115 725.059139 843.327820
14 WNV_07_B 3407.560131 1420.766418 1263.141424 1370.854252 1343.314896 3418.642050 1273.326803 1231.838057 3172.606370 ... 1487.360275 3352.188993 2824.607390 935.389096 1390.926173 2735.123015 787.431197 939.634776 783.109349 813.413166
15 WNV_08_B 3532.530264 1364.843016 1255.425786 1334.504967 1321.196667 3367.519860 1276.494220 1180.057855 3186.802177 ... 1508.252040 3303.251160 2808.743431 895.629178 1408.825017 2591.212414 738.106449 929.854376 599.774053 754.449005

16 rows × 57 columns

Phew! Nearly there, now we just merge the percentage statistics together with the median intensity statistics, into one DataFrame called stats.

In [ ]:
stats = per.merge(med)
stats
Out[ ]:
sample percent_B cells percent_Immature neutrophils percent_Mature neutrophils percent_Monocytes percent_Other/Unknown percent_Plasma cells percent_T cells median_B220_B cells median_B220_Immature neutrophils ... median_Ly6C_Other/Unknown median_Ly6C_Plasma cells median_Ly6C_T cells median_Ly6G_B cells median_Ly6G_Immature neutrophils median_Ly6G_Mature neutrophils median_Ly6G_Monocytes median_Ly6G_Other/Unknown median_Ly6G_Plasma cells median_Ly6G_T cells
0 Mock_01_A 29.87 1.50 43.60 12.53 6.76 1.51 4.23 3090.182943 1307.665802 ... 1339.180622 3104.558784 2135.911893 935.259989 1411.483259 2948.457340 751.430865 953.763330 767.930608 579.926444
1 Mock_02_A 31.57 1.82 42.13 11.51 7.65 1.34 3.98 3085.970961 1272.196495 ... 1362.432823 3037.105580 1499.337551 924.586287 1471.019450 2886.312281 755.706007 939.438870 637.209433 579.838249
2 Mock_03_A 31.82 1.90 40.36 13.22 7.54 1.31 3.85 3041.428383 1281.298668 ... 1297.972361 3005.799229 1699.243312 920.232321 1441.387585 3011.716552 750.735128 925.098875 625.057153 580.685089
3 Mock_04_A 31.60 2.18 41.10 13.27 7.00 1.42 3.43 3034.436092 1307.427494 ... 1322.584431 3007.110568 1707.953541 923.717498 1422.946623 3051.354286 750.476799 950.173689 573.280932 593.830408
4 Mock_05_B 30.42 2.12 44.06 12.31 6.87 1.06 3.16 3086.026846 1318.373726 ... 1318.134776 3191.162596 1560.061943 934.937702 1479.374802 2950.421613 717.472601 911.502331 676.084399 807.338025
5 Mock_06_B 31.08 1.64 43.07 12.93 6.51 1.14 3.63 3188.706530 1256.862809 ... 1364.021029 3154.266730 1498.393482 970.337384 1565.147233 3005.693770 791.220549 986.831826 634.909694 913.175047
6 Mock_07_B 29.99 1.92 42.56 12.22 6.58 1.00 5.73 3317.330232 1290.941890 ... 1350.879769 3105.305477 1572.118538 1033.288541 1551.196799 3038.078472 808.987159 1006.440278 736.455393 933.515144
7 Mock_08_B 29.85 3.04 43.87 14.88 5.97 0.35 2.04 3049.365297 1384.366367 ... 1359.002096 3129.048865 2609.474726 973.210710 1408.020458 3041.685610 761.469904 1004.248264 792.685201 781.961909
8 WNV_01_A 16.26 4.01 35.93 24.68 7.76 3.81 7.55 3428.948815 1370.059547 ... 1515.006429 3111.122996 2704.455244 849.190544 1366.928840 2826.922345 692.327427 930.563518 668.818900 507.506607
9 WNV_02_A 16.30 4.75 33.96 26.21 7.96 4.08 6.74 3358.582495 1325.453077 ... 1542.863123 3137.276400 2835.085762 859.054324 1349.238459 2751.546040 711.855131 913.830234 672.286197 476.394896
10 WNV_03_A 23.64 4.71 29.93 24.62 7.09 4.22 5.79 3232.116295 1344.004772 ... 1549.755835 3150.929616 2937.027799 891.618697 1363.768406 2584.591178 713.370671 911.547972 675.465829 388.150655
11 WNV_04_A 17.66 4.58 39.70 19.59 6.25 4.12 8.10 3113.307588 1321.082754 ... 1510.893847 3188.097947 2833.182465 861.382952 1433.205478 2858.445502 760.760886 913.774101 671.511543 540.268419
12 WNV_05_B 14.68 6.69 35.84 26.16 7.81 2.44 6.38 3507.954731 1352.112915 ... 1519.219566 3234.496025 2935.956810 901.263874 1316.214670 2627.029764 782.437117 957.514527 737.582369 789.656096
13 WNV_06_B 22.34 4.61 35.54 18.67 7.53 2.99 8.32 3460.297173 1394.089066 ... 1535.759098 3319.703598 2554.642605 946.795196 1360.411652 2778.339594 786.406039 942.652115 725.059139 843.327820
14 WNV_07_B 18.21 5.12 36.67 20.97 7.36 3.31 8.36 3407.560131 1420.766418 ... 1487.360275 3352.188993 2824.607390 935.389096 1390.926173 2735.123015 787.431197 939.634776 783.109349 813.413166
15 WNV_08_B 19.88 5.74 34.35 21.94 7.40 2.94 7.75 3532.530264 1364.843016 ... 1508.252040 3303.251160 2808.743431 895.629178 1408.825017 2591.212414 738.106449 929.854376 599.774053 754.449005

16 rows × 64 columns

Plotting heatmap of summary statistics¶

One option to visualise the summary statistics we just generated is to use a heatmap. With a large number of antigens and/or populations, this heatmap can grow large and hard to read, but we’ll generate one anyway and see how we can filter it down to more interesting variables.

It can be useful to use a diverging colour palette to highlight differences in the data. There are built-in palettes, but we illustrate how to create your own below.

In [ ]:
my_pal = sns.diverging_palette(
    h_neg  = 250, 
    h_pos  = 10, 
    s      = 99, 
    l      = 50, 
    sep    = 1, 
    n      = 50, 
    center = "dark"
)

To get a feel for what these arguments control, you can experiment with the choose_diverging_palette() function.

In [ ]:
sns.choose_diverging_palette()
interactive(children=(IntSlider(value=220, description='h_neg', max=359), IntSlider(value=10, description='h_p…
Out[ ]:
[(0.2519971417644415, 0.4987337088076726, 0.5751602783606602),
 (0.43156001218774975, 0.6160490836499025, 0.6735874169971766),
 (0.611122882611058, 0.7333644584921324, 0.7720145556336929),
 (0.7906857530343663, 0.8506798333343624, 0.8704416942702093),
 (0.95, 0.95, 0.95),
 (0.9282549678814984, 0.7863704363662967, 0.7963965173228867),
 (0.9022582584936525, 0.6005186021022944, 0.622400049291663),
 (0.8762615491058064, 0.4146667678382919, 0.44840358126043944),
 (0.8510408608937171, 0.23436274952246883, 0.2796010376480583)]

It's useful to add columns of row annotations to the heatmap, which we can do by generating a DataFrame with colours we want mapped to each row.

In [ ]:
lut1 = {'Mock': '#E41A1C', 'Virus': '#377EB8'}
row_colors1 = metadata['group'].map(lut1)

lut2 = {1: '#4DAF4A', 2: '#984EA3'}
row_colors2 = metadata['batch'].map(lut2)

row_colors = pd.concat([row_colors1, row_colors2], axis = 1)

row_colors.index = metadata['sample']
row_colors
Out[ ]:
group batch
sample
Mock_01_A #E41A1C #4DAF4A
Mock_02_A #E41A1C #4DAF4A
Mock_03_A #E41A1C #4DAF4A
Mock_04_A #E41A1C #4DAF4A
WNV_01_A #377EB8 #4DAF4A
WNV_02_A #377EB8 #4DAF4A
WNV_03_A #377EB8 #4DAF4A
WNV_04_A #377EB8 #4DAF4A
Mock_05_B #E41A1C #984EA3
Mock_06_B #E41A1C #984EA3
Mock_07_B #E41A1C #984EA3
Mock_08_B #E41A1C #984EA3
WNV_05_B #377EB8 #984EA3
WNV_06_B #377EB8 #984EA3
WNV_07_B #377EB8 #984EA3
WNV_08_B #377EB8 #984EA3

Now we turn our 'sample' column into the index.

In [ ]:
stats = stats.set_index('sample')

And plot the heatmap, using z_score = 1 to indicate we wish the data to be z-scored within each column (variable). Notice that our row annotations show the two groups of samples cluster separately, but that there is still a batch effect present in the data.

In [ ]:
sns.clustermap(
    stats,
    z_score    = 1,
    cmap       = my_pal,
    row_colors = row_colors,
    figsize    = (16, 8),
    cbar_pos   = (0.08, 0.12, 0.03, 0.2),
    cbar_kws   = {'label' : 'Scaled expression'}
)
Out[ ]:
<seaborn.matrix.ClusterGrid at 0x16b1d6b2800>
No description has been provided for this image

Principal component analysis of samples¶

We can use PCA to see if our groups of samples partition separately from each other when projected onto the first two principal components. Because we have variables on different scales (percentages and MFIs), it's important we first compute z-scores for each of our variables.

In [ ]:
stats_zscore = stats.apply(scipy.stats.zscore)
stats_zscore
Out[ ]:
percent_B cells percent_Immature neutrophils percent_Mature neutrophils percent_Monocytes percent_Other/Unknown percent_Plasma cells percent_T cells median_B220_B cells median_B220_Immature neutrophils median_B220_Mature neutrophils ... median_Ly6C_Other/Unknown median_Ly6C_Plasma cells median_Ly6C_T cells median_Ly6G_B cells median_Ly6G_Immature neutrophils median_Ly6G_Mature neutrophils median_Ly6G_Monocytes median_Ly6G_Other/Unknown median_Ly6G_Plasma cells median_Ly6G_T cells
sample
Mock_01_A 0.802854 -1.236989 1.115000 -0.985028 -0.638389 -0.630948 -0.655261 -0.884702 -0.538677 -0.484373 ... -0.975356 -0.532374 -0.275018 0.284556 -0.146172 0.588752 -0.072959 0.298825 1.294187 -0.618597
Mock_02_A 1.066753 -1.041091 0.765010 -1.173643 0.907641 -0.764192 -0.777969 -0.908629 -1.324767 -0.891918 ... -0.726207 -1.201652 -1.377121 0.051218 0.744326 0.195435 0.060836 -0.178963 -0.772626 -0.619141
Mock_03_A 1.105562 -0.992117 0.343592 -0.857435 0.716559 -0.787705 -0.841777 -1.161661 -1.123040 -1.138068 ... -1.416904 -1.512277 -1.031023 -0.043964 0.301114 0.989122 -0.094733 -0.657269 -0.964764 -0.613917
Mock_04_A 1.071410 -0.820706 0.519778 -0.848189 -0.221482 -0.701489 -1.047926 -1.201382 -0.543959 -1.102514 ... -1.153184 -1.499266 -1.015943 0.032225 0.025288 1.239990 -0.102817 0.179094 -1.783390 -0.532834
Mock_05_B 0.888233 -0.857437 1.224521 -1.025709 -0.447306 -0.983652 -1.180451 -0.908311 -0.301362 -0.487835 ... -1.200863 0.326919 -1.271988 0.277511 0.869299 0.601184 -1.135716 -1.110778 -0.157980 0.784119
Mock_06_B 0.990688 -1.151284 0.988813 -0.911061 -1.072667 -0.920949 -0.949760 -0.325020 -1.664601 -0.795472 ... -0.709190 -0.039166 -1.378755 1.051384 2.152220 0.951003 1.172298 1.401815 -0.808987 1.436940
Mock_07_B 0.821483 -0.979873 0.867388 -1.042352 -0.951069 -1.030679 0.080987 0.405650 -0.909322 0.855124 ... -0.849999 -0.524965 -1.251115 2.427560 1.943560 1.155967 1.728321 2.055849 0.796537 1.562402
Mock_08_B 0.799750 -0.294230 1.179284 -0.550474 -2.010707 -1.540141 -1.730182 -1.116574 1.161202 -0.667059 ... -0.762968 -0.289380 0.544862 1.114198 -0.197966 1.178796 0.241223 1.982735 1.685578 0.627594
WNV_01_A -1.309890 0.299586 -0.711142 1.261709 1.098723 1.171761 0.974301 1.039720 0.844127 -0.739449 ... 0.908625 -0.467243 0.709302 -1.597010 -0.812585 -0.180445 -1.922658 -0.474998 -0.272854 -1.065295
WNV_02_A -1.303680 0.752601 -1.180177 1.544631 1.446146 1.383383 0.576727 0.639991 -0.144466 -0.777166 ... 1.207111 -0.207746 0.935463 -1.381378 -1.077184 -0.657504 -1.311520 -1.033131 -0.218033 -1.257198
WNV_03_A -0.164258 0.728113 -2.139675 1.250614 -0.065142 1.493113 0.110437 -0.078423 0.266687 0.670287 ... 1.280967 -0.072277 1.111956 -0.669487 -0.859856 -1.714165 -1.264090 -1.109256 -0.167760 -1.801504
WNV_04_A -1.092561 0.648530 0.186453 0.320484 -1.524316 1.414735 1.244259 -0.753338 -0.241323 0.118939 ... 0.864559 0.296512 0.932168 -1.330471 0.178733 0.019065 0.219033 -1.035004 -0.230281 -0.863214
WNV_05_B -1.555160 1.940233 -0.732570 1.535386 1.185579 0.097973 0.400028 1.488527 0.446384 1.048090 ... 0.953769 0.756879 1.110101 -0.458634 -1.571130 -1.445570 0.897412 0.423945 0.814356 0.675053
WNV_06_B -0.366063 0.666895 -0.803997 0.150360 0.699188 0.529056 1.352242 1.217800 1.376682 0.576312 ... 1.130991 1.602318 0.449931 0.536728 -0.910064 -0.487927 1.021623 -0.071786 0.616353 1.006110
WNV_07_B -1.007182 0.979108 -0.534956 0.575668 0.403879 0.779868 1.371875 0.918218 1.967920 2.119479 ... 0.612396 1.924642 0.917322 0.287379 -0.453650 -0.761445 1.053706 -0.172429 1.534176 0.821591
WNV_08_B -0.747940 1.358660 -1.087322 0.755038 0.473363 0.489867 1.072468 1.628133 0.728515 1.695625 ... 0.836252 1.439075 0.889857 -0.581814 -0.185932 -1.672259 -0.489959 -0.498651 -1.364512 0.457890

16 rows × 63 columns

We specify the number of components we want to be returned (just 2 for plotting) using the PCA() function, then actually compute the projections using the fit_transform() method. This gives us an array with one row per sample, and a column for each principal component projection.

In [ ]:
pca = PCA(n_components = 2)
stats_pca = pca.fit_transform(stats_zscore)
stats_pca
Out[ ]:
array([[-4.93095277, -0.57487011],
       [-6.37596216, -1.27243114],
       [-5.32943365, -3.3606912 ],
       [-4.86278672, -3.58585446],
       [-4.83576239,  1.35172841],
       [-5.2420011 ,  3.17651377],
       [-4.39018757,  5.36285259],
       [-3.9788662 ,  2.25648044],
       [ 3.62261647, -3.90003197],
       [ 4.47680926, -4.23085216],
       [ 4.37420167, -4.21502922],
       [ 2.91086004, -3.21233533],
       [ 7.70780641,  1.72354253],
       [ 4.18458799,  4.41752128],
       [ 6.39532467,  3.79089147],
       [ 6.27374605,  2.27256511]])

We can add these as columns to our stats DataFrame.

In [ ]:
stats['PCA1'] = stats_pca[:, 0]
stats['PCA2'] = stats_pca[:, 1]
stats
Out[ ]:
percent_B cells percent_Immature neutrophils percent_Mature neutrophils percent_Monocytes percent_Other/Unknown percent_Plasma cells percent_T cells median_B220_B cells median_B220_Immature neutrophils median_B220_Mature neutrophils ... median_Ly6C_T cells median_Ly6G_B cells median_Ly6G_Immature neutrophils median_Ly6G_Mature neutrophils median_Ly6G_Monocytes median_Ly6G_Other/Unknown median_Ly6G_Plasma cells median_Ly6G_T cells PCA1 PCA2
sample
Mock_01_A 29.87 1.50 43.60 12.53 6.76 1.51 4.23 3090.182943 1307.665802 1215.742070 ... 2135.911893 935.259989 1411.483259 2948.457340 751.430865 953.763330 767.930608 579.926444 -4.930953 -0.574870
Mock_02_A 31.57 1.82 42.13 11.51 7.65 1.34 3.98 3085.970961 1272.196495 1208.323320 ... 1499.337551 924.586287 1471.019450 2886.312281 755.706007 939.438870 637.209433 579.838249 -6.375962 -1.272431
Mock_03_A 31.82 1.90 40.36 13.22 7.54 1.31 3.85 3041.428383 1281.298668 1203.842502 ... 1699.243312 920.232321 1441.387585 3011.716552 750.735128 925.098875 625.057153 580.685089 -5.329434 -3.360691
Mock_04_A 31.60 2.18 41.10 13.27 7.00 1.42 3.43 3034.436092 1307.427494 1204.489710 ... 1707.953541 923.717498 1422.946623 3051.354286 750.476799 950.173689 573.280932 593.830408 -4.862787 -3.585854
Mock_05_B 30.42 2.12 44.06 12.31 6.87 1.06 3.16 3086.026846 1318.373726 1215.679054 ... 1560.061943 934.937702 1479.374802 2950.421613 717.472601 911.502331 676.084399 807.338025 -4.835762 1.351728
Mock_06_B 31.08 1.64 43.07 12.93 6.51 1.14 3.63 3188.706530 1256.862809 1210.078981 ... 1498.393482 970.337384 1565.147233 3005.693770 791.220549 986.831826 634.909694 913.175047 -5.242001 3.176514
Mock_07_B 29.99 1.92 42.56 12.22 6.58 1.00 5.73 3317.330232 1290.941890 1240.125675 ... 1572.118538 1033.288541 1551.196799 3038.078472 808.987159 1006.440278 736.455393 933.515144 -4.390188 5.362853
Mock_08_B 29.85 3.04 43.87 14.88 5.97 0.35 2.04 3049.365297 1384.366367 1212.416547 ... 2609.474726 973.210710 1408.020458 3041.685610 761.469904 1004.248264 792.685201 781.961909 -3.978866 2.256480
WNV_01_A 16.26 4.01 35.93 24.68 7.76 3.81 7.55 3428.948815 1370.059547 1211.098785 ... 2704.455244 849.190544 1366.928840 2826.922345 692.327427 930.563518 668.818900 507.506607 3.622616 -3.900032
WNV_02_A 16.30 4.75 33.96 26.21 7.96 4.08 6.74 3358.582495 1325.453077 1210.412213 ... 2835.085762 859.054324 1349.238459 2751.546040 711.855131 913.830234 672.286197 476.394896 4.476809 -4.230852
WNV_03_A 23.64 4.71 29.93 24.62 7.09 4.22 5.79 3232.116295 1344.004772 1236.760987 ... 2937.027799 891.618697 1363.768406 2584.591178 713.370671 911.547972 675.465829 388.150655 4.374202 -4.215029
WNV_04_A 17.66 4.58 39.70 19.59 6.25 4.12 8.10 3113.307588 1321.082754 1226.724491 ... 2833.182465 861.382952 1433.205478 2858.445502 760.760886 913.774101 671.511543 540.268419 2.910860 -3.212335
WNV_05_B 14.68 6.69 35.84 26.16 7.81 2.44 6.38 3507.954731 1352.112915 1243.638338 ... 2935.956810 901.263874 1316.214670 2627.029764 782.437117 957.514527 737.582369 789.656096 7.707806 1.723543
WNV_06_B 22.34 4.61 35.54 18.67 7.53 2.99 8.32 3460.297173 1394.089066 1235.050305 ... 2554.642605 946.795196 1360.411652 2778.339594 786.406039 942.652115 725.059139 843.327820 4.184588 4.417521
WNV_07_B 18.21 5.12 36.67 20.97 7.36 3.31 8.36 3407.560131 1420.766418 1263.141424 ... 2824.607390 935.389096 1390.926173 2735.123015 787.431197 939.634776 783.109349 813.413166 6.395325 3.790891
WNV_08_B 19.88 5.74 34.35 21.94 7.40 2.94 7.75 3532.530264 1364.843016 1255.425786 ... 2808.743431 895.629178 1408.825017 2591.212414 738.106449 929.854376 599.774053 754.449005 6.273746 2.272565

16 rows × 65 columns

So that we can plot the principal component projections, lets add the sample metadata to stats (by merging on a common index of the sample column).

In [ ]:
metadata.set_index('sample', inplace = True)
stats = stats.join(metadata)
stats
Out[ ]:
percent_B cells percent_Immature neutrophils percent_Mature neutrophils percent_Monocytes percent_Other/Unknown percent_Plasma cells percent_T cells median_B220_B cells median_B220_Immature neutrophils median_B220_Mature neutrophils ... median_Ly6G_Other/Unknown median_Ly6G_Plasma cells median_Ly6G_T cells PCA1 PCA2 filename group batch_label batch reference
sample
Mock_01_A 29.87 1.50 43.60 12.53 6.76 1.51 4.23 3090.182943 1307.665802 1215.742070 ... 953.763330 767.930608 579.926444 -4.930953 -0.574870 Mock_01_A.fcs Mock A 1 ref
Mock_02_A 31.57 1.82 42.13 11.51 7.65 1.34 3.98 3085.970961 1272.196495 1208.323320 ... 939.438870 637.209433 579.838249 -6.375962 -1.272431 Mock_02_A.fcs Mock A 1 other
Mock_03_A 31.82 1.90 40.36 13.22 7.54 1.31 3.85 3041.428383 1281.298668 1203.842502 ... 925.098875 625.057153 580.685089 -5.329434 -3.360691 Mock_03_A.fcs Mock A 1 other
Mock_04_A 31.60 2.18 41.10 13.27 7.00 1.42 3.43 3034.436092 1307.427494 1204.489710 ... 950.173689 573.280932 593.830408 -4.862787 -3.585854 Mock_04_A.fcs Mock A 1 other
Mock_05_B 30.42 2.12 44.06 12.31 6.87 1.06 3.16 3086.026846 1318.373726 1215.679054 ... 911.502331 676.084399 807.338025 -4.835762 1.351728 Mock_05_B.fcs Mock B 2 ref
Mock_06_B 31.08 1.64 43.07 12.93 6.51 1.14 3.63 3188.706530 1256.862809 1210.078981 ... 986.831826 634.909694 913.175047 -5.242001 3.176514 Mock_06_B.fcs Mock B 2 other
Mock_07_B 29.99 1.92 42.56 12.22 6.58 1.00 5.73 3317.330232 1290.941890 1240.125675 ... 1006.440278 736.455393 933.515144 -4.390188 5.362853 Mock_07_B.fcs Mock B 2 other
Mock_08_B 29.85 3.04 43.87 14.88 5.97 0.35 2.04 3049.365297 1384.366367 1212.416547 ... 1004.248264 792.685201 781.961909 -3.978866 2.256480 Mock_08_B.fcs Mock B 2 other
WNV_01_A 16.26 4.01 35.93 24.68 7.76 3.81 7.55 3428.948815 1370.059547 1211.098785 ... 930.563518 668.818900 507.506607 3.622616 -3.900032 Virus_01_A.fcs Virus A 1 other
WNV_02_A 16.30 4.75 33.96 26.21 7.96 4.08 6.74 3358.582495 1325.453077 1210.412213 ... 913.830234 672.286197 476.394896 4.476809 -4.230852 Virus_02_A.fcs Virus A 1 other
WNV_03_A 23.64 4.71 29.93 24.62 7.09 4.22 5.79 3232.116295 1344.004772 1236.760987 ... 911.547972 675.465829 388.150655 4.374202 -4.215029 Virus_03_A.fcs Virus A 1 other
WNV_04_A 17.66 4.58 39.70 19.59 6.25 4.12 8.10 3113.307588 1321.082754 1226.724491 ... 913.774101 671.511543 540.268419 2.910860 -3.212335 Virus_04_A.fcs Virus A 1 other
WNV_05_B 14.68 6.69 35.84 26.16 7.81 2.44 6.38 3507.954731 1352.112915 1243.638338 ... 957.514527 737.582369 789.656096 7.707806 1.723543 Virus_05_B.fcs Virus B 2 other
WNV_06_B 22.34 4.61 35.54 18.67 7.53 2.99 8.32 3460.297173 1394.089066 1235.050305 ... 942.652115 725.059139 843.327820 4.184588 4.417521 Virus_06_B.fcs Virus B 2 other
WNV_07_B 18.21 5.12 36.67 20.97 7.36 3.31 8.36 3407.560131 1420.766418 1263.141424 ... 939.634776 783.109349 813.413166 6.395325 3.790891 Virus_07_B.fcs Virus B 2 other
WNV_08_B 19.88 5.74 34.35 21.94 7.40 2.94 7.75 3532.530264 1364.843016 1255.425786 ... 929.854376 599.774053 754.449005 6.273746 2.272565 Virus_08_B.fcs Virus B 2 other

16 rows × 70 columns

And now we visualize the projections, colouring by group and using different shapes for the batches. It seems as though PCA1 largely captures the differences between groups (which seem substantial), while PCA2 largely captures the remaining batch effect.

In [ ]:
sns.scatterplot(
    stats,
    x     = 'PCA1',
    y     = 'PCA2',
    hue   = 'group',
    style = 'batch'
)
Out[ ]:
<Axes: xlabel='PCA1', ylabel='PCA2'>
No description has been provided for this image

Pairwise comparisons¶

To identify summary statistics that are unlikely to be the same across our groups, we can perform pairwise Mann-Whitney U tests (non-parametric equivalent of t test) and compute fold changes and false discovery rate (FDR)-corrected p values. There are more sophisitcated things we can do (especially to control for that batch effect), but this will do as a first pass.

Let's start by splitting the DataFrame into one corresponding to the mock samples, and another corresponding to the virus samples.

In [ ]:
mock  = stats[stats['group'] == 'Mock']
virus = stats[stats['group'] == 'Virus']

Then, we iterate over each statistic, perform a Mann-Whitney U test, calculate a log2 fold change of medians, and return a dictionary of fold change, log2 fold change, and p-value. We convert this into a DataFrame, and add a False Discovery Rate-corrected p-value, its -log10 value (for creating a volcano plot), and a column that we can colour by. There's quite a lot going on below but take it one step at a time and you should be able to follow it.

In [ ]:
results = {}

for column in stats.columns[:63]:
    u_statistic, p_value = scipy.stats.mannwhitneyu(mock[column], virus[column])
    fc = virus[column].median() / mock[column].median()
    log2fc = np.log2(fc)

    results[column] = {
        'FC'         : fc,
        'Log2FC'     : log2fc,
        'p-value'    : p_value
    }

stats_tab = pd.DataFrame(results).transpose()
stats_tab['FDR_p-value'] = scipy.stats.false_discovery_control(stats_tab['p-value'])
stats_tab['neg_log10_FDR_p-value'] = -np.log10(stats_tab['FDR_p-value'])
stats_tab['Sig'] = ['p > 0.05' if p > 0.05 else 'p < 0.05' for p in stats_tab['FDR_p-value']]
stats_tab
Out[ ]:
FC Log2FC p-value FDR_p-value neg_log10_FDR_p-value Sig
percent_B cells 0.583252 -0.777809 0.000155 0.000576 3.239657 p < 0.05
percent_Immature neutrophils 2.476440 1.308268 0.000155 0.000576 3.239657 p < 0.05
percent_Mature neutrophils 0.833586 -0.262596 0.000155 0.000576 3.239657 p < 0.05
percent_Monocytes 1.828751 0.870859 0.000155 0.000576 3.239657 p < 0.05
percent_Other/Unknown 1.095378 0.131429 0.082984 0.137578 0.861450 p > 0.05
... ... ... ... ... ... ...
median_Ly6G_Mature neutrophils 0.911799 -0.133212 0.000155 0.000576 3.239657 p < 0.05
median_Ly6G_Monocytes 0.994513 -0.007938 0.441803 0.545756 0.263001 p > 0.05
median_Ly6G_Other/Unknown 0.977143 -0.033359 0.160528 0.229847 0.638560 p > 0.05
median_Ly6G_Plasma cells 1.026238 0.037365 0.798446 0.852578 0.069266 p > 0.05
median_Ly6G_T cells 0.941070 -0.087625 0.278632 0.373486 0.427726 p > 0.05

63 rows × 6 columns

We can visualise the log2 fold change against the p values using a volcano plot.

In [ ]:
p = sns.scatterplot(
    stats_tab,
    x   = 'Log2FC',
    y   = 'neg_log10_FDR_p-value',
    hue = 'Sig'
)

plt.xlim(-1.6, 1.6)
plt.axhline(y = 1.3) 
plt.axvline(x = -np.log2(1.03)) 
plt.axvline(x = np.log2(1.03)) 
Out[ ]:
<matplotlib.lines.Line2D at 0x16b25e9f9a0>
No description has been provided for this image

Filtering comparisons of interest¶

Let’s focus in on comparisons with fold changes and p values above certain thresholds. Let’s define any feature having a (FDR-corrected) p value smaller than 0.05, and having a log2-fold change value >= 0.04 (this corresponds to a fold change of ~1.02), as of interest. I've set the bar very low but you can customise this.

In [ ]:
fold_filter    = abs(stats_tab['Log2FC']) >= np.log2(0.04)
p_filter       = stats_tab['FDR_p-value'] < 0.05
stats_filtered = stats_tab[fold_filter & p_filter]
stats_filtered
Out[ ]:
FC Log2FC p-value FDR_p-value neg_log10_FDR_p-value Sig
percent_B cells 0.583252 -0.777809 0.000155 0.000576 3.239657 p < 0.05
percent_Immature neutrophils 2.476440 1.308268 0.000155 0.000576 3.239657 p < 0.05
percent_Mature neutrophils 0.833586 -0.262596 0.000155 0.000576 3.239657 p < 0.05
percent_Monocytes 1.828751 0.870859 0.000155 0.000576 3.239657 p < 0.05
percent_Plasma cells 2.906122 1.539095 0.000155 0.000576 3.239657 p < 0.05
percent_T cells 2.045455 1.032421 0.000155 0.000576 3.239657 p < 0.05
median_B220_B cells 1.107665 0.147522 0.001088 0.003263 2.486329 p < 0.05
median_B220_Immature neutrophils 1.045639 0.064385 0.004662 0.011748 1.930027 p < 0.05
median_B220_Monocytes 1.068484 0.095566 0.000155 0.000576 3.239657 p < 0.05
median_B220_Other/Unknown 1.063518 0.088844 0.000155 0.000576 3.239657 p < 0.05
median_CD11b_T cells 1.067611 0.094385 0.001865 0.004895 2.310238 p < 0.05
median_CD3e_B cells 1.020668 0.029514 0.000622 0.001958 2.708178 p < 0.05
median_CD3e_Mature neutrophils 0.990726 -0.013443 0.014763 0.032071 1.493882 p < 0.05
median_CD3e_Monocytes 1.029150 0.041453 0.000155 0.000576 3.239657 p < 0.05
median_CD3e_Other/Unknown 1.018341 0.026221 0.014763 0.032071 1.493882 p < 0.05
median_CD45_B cells 1.367670 0.451720 0.000155 0.000576 3.239657 p < 0.05
median_CD45_Immature neutrophils 1.149965 0.201590 0.000155 0.000576 3.239657 p < 0.05
median_CD45_Monocytes 1.166929 0.222717 0.000155 0.000576 3.239657 p < 0.05
median_CD45_Other/Unknown 1.132105 0.179008 0.000155 0.000576 3.239657 p < 0.05
median_CD45_T cells 1.025594 0.036460 0.020668 0.040691 1.390506 p < 0.05
median_CD48_Immature neutrophils 1.068819 0.096018 0.014763 0.032071 1.493882 p < 0.05
median_CD48_Mature neutrophils 1.147134 0.198035 0.000155 0.000576 3.239657 p < 0.05
median_CD48_Plasma cells 0.961270 -0.056986 0.001865 0.004895 2.310238 p < 0.05
median_Ly6C_B cells 1.181472 0.240586 0.000155 0.000576 3.239657 p < 0.05
median_Ly6C_Mature neutrophils 0.931586 -0.102240 0.000311 0.001031 2.986932 p < 0.05
median_Ly6C_Monocytes 0.967948 -0.046999 0.020668 0.040691 1.390506 p < 0.05
median_Ly6C_Other/Unknown 1.127940 0.173690 0.000155 0.000576 3.239657 p < 0.05
median_Ly6C_Plasma cells 1.034257 0.048594 0.010412 0.025229 1.598107 p < 0.05
median_Ly6C_T cells 1.729491 0.790347 0.000311 0.001031 2.986932 p < 0.05
median_Ly6G_B cells 0.955646 -0.065451 0.020668 0.040691 1.390506 p < 0.05
median_Ly6G_Immature neutrophils 0.937608 -0.092943 0.001865 0.004895 2.310238 p < 0.05
median_Ly6G_Mature neutrophils 0.911799 -0.133212 0.000155 0.000576 3.239657 p < 0.05

Next, we use the index of features that meet these criteria and re plot our heatmap again, but this time we only include the filtered features based on our statistical analysis.

In [ ]:
sns.clustermap(
    stats[stats_filtered.index],
    z_score    = 1,
    cmap       = my_pal,
    row_colors = row_colors,
    figsize    = (16, 8),
    cbar_pos   = (0.08, 0.12, 0.03, 0.2),
    cbar_kws   = {'label' : 'Scaled expression'}
)
Out[ ]:
<seaborn.matrix.ClusterGrid at 0x16b23372170>
No description has been provided for this image

For individual plots comparing the groups per statistic, we create boxplots.

In [ ]:
for stat in stats[stats_filtered.index].columns:
    sns.catplot(
        data = stats, 
        x    = 'group', 
        y    = stat, 
        hue  = 'group',
        kind = 'box'
    )
c:\Python_3.10.2\lib\site-packages\seaborn\axisgrid.py:447: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  fig = plt.figure(figsize=figsize)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image